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Abstract 

We consider the linear contextual bandit problem with resource consumption, in addition to reward 
generation. In each round, the outcome of pulling an arm is a reward as well as a vector of resource 
consumptions. The expected values of these outcomes depend linearly on the context of that arm. The 
budget/capacity constraints require that the total consumption doesn’t exceed the budget for each re¬ 
source. The objective is once again to maximize the total reward. This problem turns out to be a 
common generalization of classic linear contextual bandits (linContextual) [7l 1161 IT], bandits with knap¬ 
sacks (BwK) R1|TD|, and the online stochastic packing problem (OSPP) [HQS]. We present algorithms 
with near-optimal regret bounds for this problem. Our bounds compare favorably to results on the 
unstructured version of the problem [5] jTl] where the relation between the contexts and the outcomes 
could be arbitrary, but the algorithm only competes against a fixed set of policies accessible through 
an optimization oracle. We combine techniques from the work on linContextual, BwK and OSPP in a 
nontrivial manner while also tackling new difficulties that are not present in any of these special cases. 


1 Introduction 

In the contextual bandit problem 0PE1I1, the decision maker observes a sequence of contexts (or 
features). In every round she needs to pull one out of K arms, after observing the context for that round. 
The outcome of pulling an arm may be used along with the contexts to decide future arms. Contextual bandit 
problems have found many useful applications such as online recommendation systems, online advertising, 
and clinical trials, where the decision in every round needs to be customized to the features of the user being 
served. The linear contextual bandit problem mmm is a special case of the contextual bandit problem, 
where the outcome is linear in the feature vector encoding the context. As pointed by [2], contextual bandit 
problems represent a natural half-way point between supervised learning and reinforcement learning: the 
use of features to encode contexts and the models for the relation between these feature vectors and the 
outcome are often inherited from supervised learning, while managing the exploration-exploitation tradeoff 
is necessary to ensure good performance in reinforcement learning. The linear contextual bandit problem can 
thus be thought of as a midway between the linear regression model of supervised learning, and reinforcement 
learning. 

Recently, there has been a significant interest in introducing multiple “global constraints” in the standard 
bandit setting nnusunus]. Such constraints are crucial for many important real-world applications. For 
example, in clinical trials, the treatment plans may be constrained by the total availability of medical facilities, 
drugs and other resources. In online advertising, there are budget constraints that restrict the number of 
times an ad is shown. Other applications include dynamic pricing, dynamic procurement, crowdsourcing, 
etc.; see mm for many such examples. 

In this paper, we consider linear contextual bandit with knapsacks (henceforth, linCBwK) problem. 
In this problem, the context vectors are generated i.i.d. in every round from some unknown distribution, 
and on picking an arm, a reward and a consumption vector is observed, which depend linearly on the context 
vector. The aim of the decision maker is to maximize a total reward while ensuring the the total consumption 
of every resource remains withing a given budget. Below, we give a more precise definition of this problem. 
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We use the following notational convention throughout: vectors are denoted by bold face lower case letters, 
while matrices are denoted by regular face upper case letters. Other quantities such as sets, scalars, etc. 
may be of either case, but never bold faced. All vectors are column vectors, i.e., a vector in n dimensions is 
treated as an n x 1 matrix. The transpose of matrix A is A T . 

Definition 1 (linCBwK). There are K “arms”, which we identify with the set [K]. The algorithm is initially 
given as input a budget B £ R + . In every round t, the algorithm first observes context x t (a) £ [0, l] m for 
every arm a, and then chooses an arm at £ [K], and finally observes a reward r t (at) £ [0,1] and a d- 
dimensional consumption vector v t (at) £ [0,l] rf . The algorithm has a “no-op” option, which is to pick none 
of the arms and get 0 reward and 0 consumption. The goal of the algorithm is to pick arms such that the 
total reward Y^t=i r *( a t) maximized, while ensuring that the total consumption does not exceed budget, i.e., 
Et v *( a t) < Bl. 

We make the following stochastic assumption for context, reward, consumption vectors. In every round 
t, the tuple {xt(a),rt(a),v t (a)}K =1 is generated from an unknown distribution V, independent of everything 
in previous rounds. Also, there exists an unknown vector p* £ [0, l] m and matrix W * £ [0, l] mxd such that 
for every arm a, given contexts Xt(a), and history H t -1 before time t, 

¥.[r t {a)\x t (a),H t _ l \ = nlx t {a), E[v t (a)|z t (a), # t _i] = Wjx t (a). (1) 

For succinctness, we will denote the tuple of contexts for K arms at time t as matrix X t £ [0, l\ mxK , with 
x t (a) being the a th column of this matrix. Similarly, rewards are represented as vector r t £ [0,1]^, and 
consumption vectors are represented as matrix Vt £ [0, l] dxK . 

As we discuss later in the text, the assumption in equation 0 forms the primary distinction between 
our linear contextual bandit setting and the general contextual bandit setting considered in |5]. Exploiting 
this linearity assumption will allow us to generate regret bounds which do not depend on number of arms K , 
rendering it to be especially useful when number of arms is large. Some examples include recommendation 
systems with large number of products (e.g., retail products, travel packages, ad creatives, sponsored facebook 
posts). Another advantage over using general contextual bandit setting of [5] is that we don’t need an oracle 
access to a certain optimization problem, which is required to solve an NP-Hard problem in this case. (See 
Section o for a more detailed discusssion.) 

We compare the performance of an algorithm to that of the optimal adaptive policy that knows the 
distribution V and the parameters (p*, W*), and can take into account the history upto that point as well 
as the current context to decide (possibly with randomization) which arm to pull at time t. However, it is 
easier to work with an upper bound on this, which is the optimal expected reward of a static policy that 
is required to satisfy the constraints only in expectation. This technique has been used in several related 
problems and is standard by now UMo]. 

Definition 2 (Optimal Static Policy). Consider any policy that is context dependent but non-adaptive: for 
a policy it, let tt(X) £ \ K+1 (the unit simplex) denote the probability distribution over arms played (plus 
no-op) when the context is X £ X. Define t(tt) and v(-7r) to be the expected reward and consumption vector 
of policy tt, respectively, i.e. 


r ( 7r ) 

E(_Y, r ,y)~ 

jT) [rn(X)\ -- 

- E_y ^x>[hlX it(X)\. 

(2) 

v(tt) 

: = E(_Y,r,V)~ 

,v[Vn(X)} 

= Xn(X)\. 

( 3 ) 

7r* 

:= argmax, 

t T r(7r) 

such that T v(-7r) < Bl 

( 4 ) 


be the optimal static policy. Note that since no-op is allowed, a feasible policy always exists. We denote the 
value of this optimal static policy by OPT := T r(7r*). 

Following lemma proves that OPT upper bounds the value of optimal adaptive policy. The proof is in 
Appendix [B] 
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Lemma 1. Let OPT denote the value of optimal adaptive policy that knows the distribution D and parameters 
/i.*, W* ; We show that there exists a static policy n* such that Ty(tt*) > OPT, and Tv(ir*) < B. 

Definition 3 (Regret). Let at be the arm played at time t by the algorithm. Then, regret is defined as 

T 

regretfiT) := OPT — 

t= 1 


1.1 Main results 

Our main result is an algorithm with near-optimal regret bound for linCBwK . 

Theorem 1. There is an algorithm for linCBwK such that if B > mT 3 / 4 , then with probability at least 

1 - 6 , 

regret(T ) = O ((^£1 + l)m^\n(dT/5) ln(T)T) . 

Relation to general contextual bandits. There have been recent papers [o] 'IT that solve problems 
similar to linCBwK but for general contextual bandits. Here the relation between contexts and outcome 
vectors is arbitrary and the algorithms compete with an arbitrary fixed set of context dependent policies n 
accessible via an optimization oracle, with regret bounds being O + l)y / li’Tlog(dT|n|/(5)^ . These 

approaches could potentially be applied to the linear setting using a set n of linear context dependent policies. 
Comparing their bounds with ours, in our results, essentially a K log(|H|) factor is replaced by a factor 
of m. Most importantly, we have no dependence on kR which enables us to consider problems with large 
action spaces. In any case, both K and log(|H|) are at least m, so their bounds are no smaller. 

Further, suppose that we want to use their result with the set of linear policies, i.e., policies of the form 

arg max{x t (a) T 0}, 
a£[K] 

for some fixed 9 G 5R m . Then, their algorithms would require access to an “Arg-Max Oracle” that can find the 
best such policy (maximizing total reward) for a given set of contexts and rewards (no resource consumption). 
We show that infact the optimization problem underlying such an “Arg-Max Oracle” problem is NP-Hard, 
making such an approach computationally expensive. (Proof is in Appendix [Cj) 

The only downside to our results is that we need the budget B to be f1 (toT 3 / 4 ). Getting similar bounds 
for budgets as small as B = Q(my/T) is an interesting open problem. (This also indicates that this is indeed 
a harder problem than all the special cases.) 

Near-optimality of regret bounds. In [T?j, it was shown that for the linear contextual bandits problem, 
no online algorithm can achieve a regret bound better than f l(my/T). In fact, they prove this lower bound for 
linear contextual bandits with static contexts. Since that problem is a special case of the linCBwK problem 
with d = 1, this shows that the dependence on m and T in the above regret bound is optimal upto log factors. 
For general contextual bandits with resource constraints, the bounds of mm are near optimal. 

Relation to BwK [3] and OSPP pQ. It is easy to see that the linCBwK problem is a generalization 
of the linear contextual bandits problem HI El E5|. There, the outcome is scalar and the goal is to simply 
maximize the sum of these. Remarkably, the linCBwK problem also turns out to be a common generalization 
of bandits with knapsacks (BwK) problem considered in [121 Eli and the online stochastic packing problem 
(OSPP) studied by lfis IS 122 EES HI- In both BwK and OSPP, the outcome of every round t is a reward r t 
and a vector v t and the goal of the algorithm is to maximize XEt=i r * while ensuring that Y^t=i v t < B1. 
The problems differ in how these rewards and vectors are picked. In the OSPP problem, in every round t, 
the algorithm may pick any reward,vector pair from a given set A t of d + 1-dinrensional vectors. The set 

1 Similar to the regret bounds for linear contextual bandits mmm- 
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At is drawn i.i.d. from an unknown distribution over sets of vectors. This corresponds to the special case 
of linCBwK , where m = d + 1 and the context x t (a) itself is equal to (r t (a), v t (a). In the BwK problem, 
there is a fixed set of arms, and for each arm there is an unknown distribution over reward,vector pairs. 
The algorithm picks an arm and a reward,vector pair is drawn from the corresponding distribution for that 
arm. This corresponds to the special case of linCBwK , where m = K and the context X t = /, the identity 
matrix, for all t. 

We use techniques from all three special cases: our algorithms follow the primal-dual paradigm using 
an online learning algorithm to search the dual space, that was established in 0. In order to deal with 
linear contexts, we use techniques from ED0CE21 to estimate the weight matrix W», and define “optimistic 
estimates” of W*. We also use the technique of combining the objective and the constraints using a certain 
tradeoff parameter and that was introduced in [3]. Further new difficulties arise, such as in estimating the 
optimum value from the first few rounds, a task that follows from standard techniques in each of the special 
cases but is very challenging here. We develop a new way of exploration that uses the linear structure, so that 
one can evaluate all possible choices that could have led to an optimum solution on the historic sample. This 
technique might be of independent interest in estimating optimum values. One can see that the problem is 
indeed more than the sum of its parts, from the fact that we get the optimal bound for linCBwK only when 
B > unlike either special case for which the optimal bound holds for all B (but is meaningful 

only for B = tl(mVT)). 

The approach in [3] (for BwK) extends to the case of “static” contexts^ where each arm has a context 
that doesn’t change over time. The OSPP of [4] is not a special case of linCBwK with static contexts; this 
is one indication of the additional difficulty of dynamic over static contexts. 

Other related work. Budget constraints in a bandit setting has recieved considerable attention, but most 
of the early work focussed on special cases such as a single budget constraint in the regular (non-contextual) 
setting [2Q1 EH [261 [2911351 [36]. Recently, [38] showed an O(Vt) regret in the linear contextual setting with 
a single budget constraint, when costs depend only on contexts and not arms. Budget constraints that 
arise in particular applications such as online advertising mm, dynamic pricing Emu and crowdsourcing 
[91 [331EU have also been considered. There has also been a long line of work studying special cases of the 
OSCP problem [HI C® E2 0 ES EH E3 H2 EH Q2] . 

Due to space constraints, we have eliminated many proofs from the main text. All the missing proofs are 
in the appendix. 


2 Preliminaries 

2.1 Confidence Ellipsoid 

Consider a stochastic process which in each round t 7 generates a pair of observations (r t , y t ), such that ry is 
an unknown linear function of y t plus some 0-mean bounded noise, i.e., r* = yff y t + y t , where y t , /x* € 

M < 2R, and E[y t \y 1 ,r 1 ,... ,y t _ 1 ,r t - 1 ,y t \ = 0. 

At any time t , a high confidence estimate of the unknown vector /x* can be obtained by building a 
“Confidence Ellipsoid” around the ^-regularized least-square estimate fi t constructed from the observations 
made so far. This technique is common in prior work on linear contextual bandits (e.g., in HIEim). For 
any regularization parameter A > 0, let 

M t := XI + J2tzl y t yj, and fi t := M t _1 y^i- 

The following result from [I] shows that /x* lies with high probability in an ellipsoid with center /x t . For any 
positive semi-definite (PSD) matrix M, define the M-norm as ||/x||m := \/ /x T My,. The confidence ellipsoid 
at time t is defined as 

C t := j/x € K m : ||/x — fi t \\M t < R\Jm In />,)/&) + VAm j . 

2 It was incorrectly claimed in [3| that the approach can be extended to dynamic contexts without much modifications. 
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Lemma 2 (Theorem 2 of |Tj). IfVt, ||/ X *|| 2 < y/m and \\y t \\2 < \frn, then with prob. 1 — 6, ^t* € C*. 

Another useful observation about this construction is stated below. It first appeared as Lemma 11 of [7j, 
and was also proved as Lemma 3 in [Si- 

Lemma 3 (Lemma 11 of [7]). E*Li ||yt|| M -i — \/^T\n{T). 

As a corollary of the above two lemmas, we obtain a bound on the total error in the estimate provided 
by “any point” from the confidence ellipsoid. (Proof is in Appendix [DJ) 

Corollary 1. For t = 1,..., T, let fi t € C t be a point in the confidence ellipsoid, with A = 1,2 R = 1. Then, 
with probability 1 — 8, 

Tj=i\Pjy t ~ ^JVt\ < 2my/T In ((i+H/s) ln(T). 


2.2 Online Learning 

The online convex optimization (OCO ) problem considers a T round game played between a learner and an 
adversary, where in round t , the learner chooses a 0 t £ fl, and then the adversary picks a concave function 
<? t (0 t ) : —>■ M. The learner’s choice 6 t may only depend on learner’s and adversary’s choices in previous 

rounds. The goal of the learner is to minimize regret defined as the difference between the learner’s objective 
value and the value of the best single choice on hindsight: 

Tl{T) := maxggn Ef=i 9t(0) ~ ELi 9t(d t ). 

In particular, we will use linear reward functions with values in [—1,1], and domain f l is the unit simplex 
in d + 1 dimensions. The algorithm online mirror descent (OMD ) has very fast per step update rules, and 
provides the following regret guarantees for this setting. 

Lemma 4. f32 ] The online mirror-descent algorithm for the OCO problem achieves regret 

K(T) = 0(Vhg(dfT)- 


We actually need the domain to be 


ft = {0:||0||i<l,0>O}. 

This is a special case of a unit simplex in d + 1 dimensions, by letting the rewards on one of the dimensions 
always be zero. For the rest of the paper, we assume that the OMD algorithm is using this domain. 

3 Algorithm 

3.1 Optimistic estimates of unknown parameters 

Let a t denote the arm played by the algorithm at time t. In the beginning of every round, we use the 
outcomes and contexts from previous rounds to construct a confidence ellipsoid for /i.* and every column of 
IF*. The construction of confidence ellipsoid for /x* follows directly from the techniques in Section 12.11 with 
y t = x t (at) and r* being reward at time t. To construct a confidence ellipsoid for a column j of W*, we use 
the techniques in Section [All while substituting y t = x. t (at) and rt = v t (at)j for every j. 

As in Section HOI let M t := I + Ei=i x ;( a i) x i( a i)E an d construct the regularized least squares estimate 
for /x*, IF*, respectively, as 

At 

w t 
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: = M t 1 Ei=i x i( a i>i(a i ) T (5) 

: = ( 6 ) 





Define confidence ellipsoid for parameter /x* as 

C t ,o ■= j/x G : ||/x - fi\\ Mt < \/ m. In ((d+tmd )/ ) 
and optimistic estimate of /x* for every arm a as: 

Mt(«) : = argmax MeCt , 0 x t (a) T /x. (7) 

Let w j denote the j th column of a matrix W. We define a confidence ellipsoid for each column j, as 

C t j := jw g R m : ||w - < \/ m In (i d + tmd )/s) + \/mj , 

and denote by Q t , the Cartesian product of all these ellipsoids: Q t ■= {W g M mxd : w j g Ctj}. Note that 
Lemma [2] implies W» g Q t with probability 1 — 5. Now, given a vector 9 t g R d , we define the optimistic 
estimate of weight matrix at time t w.r.t. 9 t , for every arm a g [K ], as : 

W t (a) := arg mm We g t x t (a) r W9 t . (8) 

Intuitively, for reward we want an upper confidence bound and for consumption we want a lower confidence 
bound as an optimistic estimate. This intuition aligns with the above definitions, where the maximizer was 
used in case of reward and a minimizer was used for consumption. The utility and precise meaning of 9 t 
will become clearer when we describe the algorithm and present regret analysis. 

Using the definition of fi t ,Wt , along with the results in Lemma [2] and Corollary |T] about confidence 
ellipsoids, the following can be derived. 

Corollary 2. With probability 1 — 5, for any sequence of 9\, 9 ?,..., 9t, 

1. x t (a) T /x t (a) > x t (a) T /x, for all arms a g [A'], for all time t. 

2. x t (a) T W t (a)9 1 < x t (a) T W / *0t, for all arms a g [A'], for all time t. 

3■ | Ef=!(At(at) - M*) T x t (a t )| < (2mVTln(d+‘"*)/i) ln(T)) . 

4- II Et=i(Wt(at) - W*) T x t (a t )|| < ||l d || ( 2 m^TIn ((d+tmd)/ s ) in(r)) . 

Essentially, the first two claims ensure that we have optimistic estimates, and the last two claims ensure 
that the estimates quickly converge to the true parameters. 

3.2 The core algorithm 

In this section, we present an algorithm, and analysis, under the assumption that a certain parameter Z is 
given. Later, we show how to use the first To rounds to estimate Z, and also bound the additional regret 
due to these T 0 rounds. We define Z now. 

Assumption 1. Assume we are given Z such that < Z < 0( °g T + 1). 

The algorithm constructs estimates /x t and Wt as in Section 13.11 It also runs the OMD algorithm for 
an instance of the online learning problem, over the unit simplex. The vector played by the online learning 
algorithm in time step t is 9 t . After observing the context, the optimistic estimates for each arm are then 
constructed using 9 t , as defined in © and (©. Intuitively, 9 t is used here as a multiplier to combine different 
columns of the weight matrix, to get an optimistic weight vector for every arm. An adjusted estimated reward 
for arm a is then defined by using Z to combine optimistic estimate of reward with optimistic estimate of 
consumption, as (x t (a) T /x t (a)) — Z(x t (a) T W t (a)9 t ). The algorithm chooses the arm which appears to be the 
best according to adjusted estimated reward. After observing the resulting reward and consumption vectors, 
the estimates are updated. The online learning algorithm is advanced by one step, by defining the profit 
vector to be v t (a*) — 4l. The algorithm ends either after T time steps or as soon as the total consumption 
exceeds the budget along some dimension. 
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Algorithm 1 Algorithm for linCBwK , with given Z 

Initialize Q\ as per the OCO algorithm. 

Initialize Z such that °^ T < Z < O ( OT p +1). 
for all t = 1,T do 
Observe X t . 

For every a £ [A"], compute fi t (a ) and W t (a) as per J7]) and © respectively. 
Play the arm a* := argmax a6 [^] x t (a) T (/i t (a) — ZWt(a)6 t ). 

Observe r t (at ) and v t (cit). 

If for some j = l..d, Yht'<t v *' ( a t') ’ e : — then EXIT. 

Use x t (a t ),r t (a t ) and v t (a t ) to obtain fi t+1 ,W t+ i and £? t+ i. 

Update as per the OCO algorithm with gt{Ot) '■= 9t • (v t (at) — yl) • 

end for 


Theorem 2. Given a Z as per Assumption Q1 T(gor*t/im[7] achieves the following bounds, given that 1Z(T) 
is the regret of the OCO algorithm, with probability 1 — 5: 

regret(T) < O + l)m^Tln(dT/6) In(T)) . 

(Proof Sketch) We provide a sketch of the proof here, with the full proof in Appendix lEl Let r be the stopping 
time of the algorithm. The proof is in 3 steps: 

Step 1: Since E[v t (a t )|X t , a t , H t -\] = W lS T x t (a t ), we apply Azuma-Hoeffding to get that with high proba¬ 
bility Vt(at) — Wj yitfat) || is small. Similarly, a lower bound on the sum of gjx t (at) is sufficient. 


Step 2: From Corollary [2 with high probability, we can bound Ym-i — IU t (a t )) T x t (a t ) 
therefore sufficient to work with the sum of the vectors lUt(at) T Xt(at), and similarly /i t (at) T x ( (at). 


It is 


Step 3: The proof is completed by showing the desired bound on OPT — Xu=i /u( a t) Tx t( a i)- This part 
is similar to the online stochastic packing problem; if the actual reward and consumption vectors were 
p t (at) T x t (a t ) and W t (a t ) T x t (a t ), then it would be exactly like that problem. We adapt techniques from 
[4|: use the OCO algorithm and the Z parameter to combine constraints into the objective. If a dimension 
is being consumed too fast, then the multiplier for that dimension should increase, making the algorithm to 
pick arms that are not likely to consume too much along this dimension. 


3.3 Algorithm with Z computation 

In this section, we present a modification of Algorithm 1 which computes the required parameter Z and 
therefore does not need to be provided with a Z as input, as assumed previously in Assumption [l] The 
algorithm computes Z using the observations from first To rounds. Once Z is computed, the algorithm from 
the previous section can be run for the remaining time steps. However, it needs to be modified slightly to 
take into account the budget consumed during the first To rounds. We handle this by using a smaller budget 
B' = B — Tq in the computations for remaining rounds. The modified algorithm is given below. 


Algorithm 2 Algorithm for linCBwK , with Z computation 
Inputs: B, To, B' = B — Tq 

Using observations from first To rounds, compute Z such that < Z < 0(Qjp- + 1). 
Run Algorithm Q] for T — T 0 rounds and budget B'. 
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Next, we provide details of the first T 0 rounds, and choice of To. 

We provide a method that takes advantage of the linear structure of the problem, and explores in the m- 
dimensional space of contexts and weight vectors to obtain bounds independent of K. We use the following 
procedure. In every round t = 1,..., To, after observing X t . let p t £ AW1 be 


Pt 

:= arg max \\X t p\\ M -i, 

pCAM 4 

(9) 

where M t 

:= I + Y l lZl(X i Pi)(X iPi ) T . 

( 10 ) 


Select arm at = a with probability pt(a). In fact, since M t is a PSD matrix, due to convexity of the function 
HWpII^-u it is the same as playing at = argnrax ag [^] ||x t (a)|| M -i. Construct estimates fi,Wt of /n*, W* at 
time t as 

At := M^ 1 Ei=i(- x 'iPi) r »( a i)> W := M t _1 'Ell 1 1 (Xip l )vi(a i ) T . 

And, for some value of 7 defined later, obtain an estimate OPT of OPT as: 

OPT 7 •= maxjr 7 ^ Et=i A ?: Wj7r(X,) 

such that ^ E;=i Wj Xi-K(Xi) < B + 7 . 

For an intuition about the choice of arm in ©, observe from the discussion in Section 12.11 that every 
column w*j of W* is guaranteed to lie inside the confidence ellipsoid centered at column w t j of Wt, namely 
the ellipsoid, ||w — < 4to \n(Tm/S). Note that this ellipsoid has principle axes as eigenvectors of 

Mt , and the length of semi-principle axes is given by inverse eigenvalues of Mt . Therefore, by maximizing 
\\X t p\\ M -i we are choosing the context closest to the direction of the longest principal axes of the confidence 
ellipsoid, i.e. in the direction of maximum uncertainty. Intuitively, this corresponds to pure exploration: 
by making an observation in the direction where uncertainty is large we can reduce the uncertainty in our 
estimate most effectively. 

A more algebraic explanation is as follows. For a good estimation of OPT by OPT , we want the 
estimates W t and W* (and, (1 and fj, t ) to be close enough so that ||ESu(Wt — W^) T X t n(X t )\\ 00 (and, 
I ESi(At — /t*) T -Y t 7r(X t )|) is small for all policies 7 r, and in particular for sample optimal policies. Now, 
using Cauchy-Schwartz these are bounded by 

Ef=i IIAt - M*l|M t ||X t 7 r(A’ t ))|| M -i, and 

Ef=i IIWt - w4 Mt \\xMXt))\\ Mt -i, 

where we define || W||m) the M- norm of matrix W to be the max of column-wise M-norms. Using Lemma [U 
the term ||At — /- t *l|M t is bounded by 2 sjm ln(Tom/<5) , and ||Wt — W*||m ( is bounded by 2 m \n(Tomd/5), 
with probability 1 — <5. Lemma [3] bounds the second term Emu || A' t 7 r(X t )|| M -i but only when 7 r is the 
played policy. This is where we use that the played policy pt was chosen to maximize ||AT t p t || M -i, so that 
Etii ll A 't 7 r ( X t)|| Mt -i < E^i ll A tPt|| Mt -i and the bound Ef=i \\ x tPt\\ M -t ^ \/mT 0 hi(T 0 ) given by Lemma 
factually bounds E^i ll^t 7 r (^t)llM _1 f° r all 77 • Combining, we get a bound of 2m^T 0 \n(T 0 ) In (Tod/8) on 

deviations || Eti(W t - W*) T X t 7 r(X t )|| 0O and | E^i(At - V*) T X t n(X t ) | for all n. 

We prove the following lemma. 

Lemma 5. For 7 = 2m^/Toln(To) ln(Tod/5), with probability 1 — 0(5), 

OPT -27 < OPT' 1 < OPT+ 9j(^ + 1). 

Corollary 3. Set Z = f °^ T g +27 ^ + 1, with above value 0 / 7 . Then, with probability 1 — 0(5), 











Corollary [3] implies that as long as B > 7 , i.e., B > Ll(p=), Z is a constant factor approximation of 

+ 1 > Z*, therefore Theorem [5] should provide an O + 1 )my/rj regret bound. However, this 

bound does not account for the budget consumed in the first To rounds. Considering that (at most) To 
amount can be consumed from the budget in the first To rounds, we have an additional regret of °g T T 0 . 
Further, since we have B' = B — T 0 budget for remaining T — T 0 rounds, we need a Z that satisfies the 
required assumption for B' instead of B (i.e., we need < Z < 0(1) (pp- + l)). If B > 2T 0 , then, 
B' > B/2, and using 2 times the Z computed in Corollary [3] would satisfy the required assumption. 
Together, these observations give Theorem [3] 

Theorem 3. Using Algorithm [H with Tq such that B > max{2To,mT/v / To}, and twice the Z given by 
Corollary [3 we get a high probability regret bound of 

o((tF + 1 ) (To + mVT)) . 

In particular, using To = VT, and assuming B > mT 3 / 4 gives a regret bound of 

o((^F + !)Wt). 
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Appendix 


A Concentration Inequalities 

Lemma 6 (Azuma-Hoeffding inequality). If a super-martingale (Yf,t > 0), corresponding to filtration jFt, 
satisfies | Yt — Yi_i| < c* for some constant Ct, for all t = 1,..., T, then for any a > 0, 


Pi'(Ft — Yq > a) < e 2E t=i c i . 


B Benchmark 

Proof of Lemma[J\ For an instantiation w = (X t ,Vt)J = i °f the sequence of inputs, let vector p*(w) G X K+1 
denote the distribution over actions (plus no-op) taken by the optimal adaptive policy at time t. Then, 

OPT = E w ^Ef =1 r^H] (12) 

Also, since this is a feasible policy, 

T 

(is) 

t=l 

Construct a static context dependent policy 7r* as follows: for any X G [0, l] mxK , define 

:=^E w [p t »|X t =A]. 

t=l 

Intuitively, Tr*(X) a denotes (in hindsight) the probability that the optimal adaptive policy takes an action a 
when presented with a context A, averaged over all time steps. Now, by definition of r(7r), v(7r), from above 
definition of 7r*, and mm, m3D, 

Tr(ir*) = TE x „v[pJXv*(X)] = E^EL V t p* t (uj)\ = OPT, 

Tv(tt*) = TE x ^lWjXn*(X)] = e^eL Op:(w)] < Bl, 

□ 


C Hardness of linear AMO 

In this section we show that finding the best linear policy is NP-Hard. The input to the problem is, for each 
t G [T], and each arm a G [ I\ ], a context x t (a) G [0, l] m , and a reward r t (a) G [—1,1]. The output is a vector 
6 G 5i m that maximizes Et r t{at) where 

a t = arg max{x t (a) T 0}. 

ae[K] 

We give a reduction from the problem of learning halfspaces with noise [25| . The input to this problem is 
for some integer n, for each i G [n], a vector Zi G [0, l] m , and yi G {—1, +1}- The output is a vector 6 G 
that maximizes 

n 

^2 sign(zj9) yi . 

i= 1 

Given an instance of the problem of learning halfspaces with noise, construct an instance of the linear 
AMO as follows. The time horizon T = n, and the number of arms K = 2. For each t G [T], the context of 
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the first arm, x t (l) = z t . and its reward r t (l) = y t . The context of the second arm, x t (2) = 0, the all zeroes 
vector, and the reward r t (2) is also 0. 

The total reward of a linear policy w.r.t a vector 6 for this instance is 

|{* : sign(zj0) = 1,2/,= 1}| - |{* : sign{zj9) = 1,2/, = —1}|- 

It is easy to see that this is an affine transformation of the objective for the problem of learning halfspaces 
with noise. 


D Confidence ellipsoids 

Proof of Corollary [XJ. The following holds with probability 1 — S. 

T T 

nJx t \ < Ell^-^llMj|x t || Mt —1 

t =1 t=l 

min ^ — —<r— ^ + V^j V 171 ^ ln(T). 

The inequality in the first line is a matrix-norm version of Cauchy-Schwartz (Lemma 0. The inequality 
in the second line is due to Lemmas [2] and 0 The lemma follows from multiplying out the two factors in the 
second line. 

□ 

Lemma 7. For any positive definite matrix M £ R" xn and any two vectors a, b £ R”, |a T b| < ||a|| A/||b|| M -i. 

Proof. Since M is positive definite, there exists a matrix M \/ 2 such that M = Further, M _1 = 

Mj 1 / 2 M _ 1 /2 where M_ 1/2 = M~/\. 

||a T Mi/ 2 ||“ = a T M 1 / 2 A'f 1 l y 2 a = a T Ma = ||a||^f. 

Similarly, HM-i^by 2 = ||b||^ f _ 1 . Now applying Cauchy-Schwartz, we get that 

|a T b| = |a T M 1/2 M_ 1/2 b| < ||a T M 1/2 ||||M_ 1/2 b|| = ||a|| M ||b|| M -i. 

□ 

Proof of Corollary [2j Here, the first claim follows simply from definition of Wt(a) and the observation 
that with probability 1 — 5, W* £ Gt- To obtain the second claim, apply Corollary 0 with n* = w t: j,y t = 
Xt(ast), = [W t (a t )\j (the j th column of W t {a t )), to bound | ~ w*j) T x t (at)| < J2 t \ ([W* — 

w* :/ ) T x t (at)| for every j, and then take the norm. □ 

E Appendix for Section 13.2 

Proof of Theorem 0 We will use TZ' to denote the main term in the regret bound. 

TL'{T) := O (m^hfimdT/S) In(T) t) 

Let r be the stopping time of the algorithm. Let H t - 1 be the history of plays and observations before time 
t, i.e. H t -1 := { 0 T , X T , a T , r T (a T ), v T (a T ), r = 1,..., t — 1}. Note that H t -i determines 0 t , Ac f'Ht• Gt, but it 
does not determine X t ,a t ,W t (since a t and W t (a) depend on the context X t at time t). The proof is in 3 
steps: 
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Step 1: Since E[v t (a t )|X t , a t , H t - 1 ] = Wjx t (a t ), we apply Azuma-Hoeffding to get that with probability 
1 - <5, 

||E[=iVt(a t )-W7x t (a t )L (14) 

Similarly, a lower bound on the sum of gjx t (at) is sufficient. 


Step 2: From Corollary [2j with probability 1 — <5, 


eL (w* 


W t {a t )) T x t (a t ) 


< 7 l'(T). 

oo 


(15) 


It is therefore sufficient to bound the sum of the vectors W t {a t ) T x t (a t ), and similarly for fi t (a t ) T x t (a t ). 
We use the shorthand notation of f t := n t (a t ) T x t {a t ), r sum := Et=i^*> W := W t {a t ) T x t (a t ) and v sum := 
JOEi W for fh e rest °f this proof. 


Step 3: The proof is completed by showing that 

E[f sum ] > OPT - Zn'(T). 


Lemma 8. 

T T B 

E E ^i^-i] ^ ^opt+ zj2<>t -nv t -1-^} 

t =i t =i 

Proof. Let r* := n t (a t ) T X t n*(X t ) and v* := W t (a t ) T X t n* (X t ). By Corollary[2l with probability 1 — S, we 
have that TEx t [r*|77t_i] > OPT, and Ex t [v*|7?t-i] < yl. By the choice made by the algorithm, 

r t -Z(0 t -v t ) > r* t -Z(e t -v* t ) 

E Xt [n - Z{Q t ■ v t )\H t -i} > Ex t [rt\H t -i] - z(6 t -E[vt\Ht-i]) 

> ^OPT -ZOf^r 

Summing above inequality for t = 1 to r gives the lemma statement. □ 

Lemma 9. 

X>.(v t -§1 )>b-^--iz'(t). 

t =1 

Proof. Recall that gt{O t ) = 0 t ■ (v* — yl), therefore the LHS in the required inequality is E(=i 9((^)- 
Let 0* := argmaxueip^e^o Et=i 9t(0)- We use the regret definition for the OCO algorithm to get that 
Et=i 9t{0t) > Et=i 9t(9*) ~ 7£(T). Note that fromt the regret bound given in Lemma |4j 7 Z(T) < 1Z'(T). 

Case 1: r < T. This means that Et=i( v <(°*) ' e j) — B f° r some j. Then from (THT) and (fl5l) . it must be 
that Et=i(w • ej) > B — 11'(T) so that Et=i 9t{0*) > EEiSt( e j) > B - ^ - 7^ , (T). 

Case 2: r = T. In this case, B — ^B = 0 = Et=i St(0) < Et=i 9t(@*)i which completes the proof of the 
lemma. □ 

Now, we are ready to prove Theorem O which states that Algorithm |T] achieves a regret of Z1Z'(T). 
Proof of Theorem [2j Substituting the inequality from Lemma [9] in Lemma 0 we get 

T _ 

J2nh\Ht-i} > ^OPT + Zb(i-1) -Z1Z'(T) 
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Also, Z > Substituting in above, 

r _ _ 

E[fsum] =^E[f t | J ff t _ 1 ] > ^OPT + OPT(l -L)-zK(T) 
t=1 

> OPT - Zn\T) 

From Steps 1 and 2, this implies a lower bound on E [^} =1 The proof is now completed by using 

Azuma-Hoeffding to bound the actual total reward with high probability. □ 


F Appendix for Section 13.3 


Proof of Lemma [5j 


Let us define an “intermediate sample optimal” as: 


OPT 7 


max, TjT Ef=l /lT XiTT{Xi)]) 

such that 7 ^ Ei=i Wj A' i 7 r(A',;) < B + 7 


(16) 


Above sample optimal knows the parameters /x*, IT*, the error comes only from approximating the expected 
value over context distribution by average over the observed contexts. We do not actually compute OPT 7 , 
but will use it for the convenience of proof exposition. The proof involves two steps. 

Step 1: Bound I0PT 7 - OPT|. 

Step 2: Bound |OPT 27 - OPT 7 ! 


Step 1 bound can be borrowed from the work on Online Stochastic Convex Programming in [4]: since 
/x*, W* is known, so there is effectively full information before making the decision, i.e., consider the vectors 
[f l iJ x t (a), Wj x t (a)] as outcome vectors which can be observed for all arms a before choosing the distribution 
over arms to be played at time f, therefore, the setting in [4j applies. In fact, OPT / as defined by Equation 
(F.10) in -3j when A t = { [/xjxt(a), ITjxt (a)], a € [AT]}, / identity, and S = {v_i < g}, is same as g times 
OPT 7 defined here. And using Lemma F.4 and Lemma F .6 in [4| (using L = 1, Z* = OPT /B), we obtain 
that for any 7 > 2m-\/Toln(To) ln(Tod/5), with probability 1 — 0(6), 

_ OPT 

OPT - 7 < OPT 7 < OPT + 2 7 (-hi). (17) 

B 

For Step 2, we show that with probability 1 — <5, for all ir, 7 > (ijr) 2my / Tbln(T , 0 ) \n(T 0 d/5) 


To 

I Xin(Xi )| < 7 

2=1 


(18) 


To 


'T 0 Z^ 


^(Wi - W*) T A l 7r(A I )||oo < 7 
2=1 


(19) 


This is sufficient to prove both lower and upper bound on OPT 7 for 7 > (yjj) 2m-\/Toln(To) ln(T 0 d/5). Fol¬ 
lower bound, we can simply use csd for optimal policy for OPT 7 , denoted by tt. This implies that (because 

27 

of relaxation of distance constraint by 7 ) 7 r is a feasible primal solution for OPT , and therefore using (fTTl) 
and (fTSl) . 

OPT 27 + 7 > OFT 7 > OPT - 7 . 
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27 

For the upper bound, we can use (ITUl) for the optimal policy if for OPT . Then, using 1171) and (1T51) . 

OPT 27 < OPT 37 + 7 < OPT + 6 7 (^P- + 1) + 7 . 

Jo 

Combining, this proves the desired lemma statement: 

2 -y OPT 

OPT - 2 7 < OPT < OPT + 7j( -+ 1) (20) 

B 

What remains is to proof the claim in (1151) and m- We show the proof for (1T51) . the proof for in is 
similar. Observe that for any n, 


To 

||^(Wt- W*) T X t 7rpC)||oo 

t=l 


T 0 

< JZlKWt- W*) T X t7 r(X t )||oo 

t= 1 

< ^\\w t -w4 Mt \\x t -K{x t )\\ M - 1 

t =1 


where \\W t - W*||m 4 = max.,- ||w t j - w*j||M t - 

Now, applying Lemma [2] to every column w tj of Wt, we have that with probability 1 — <5 for all t, 


And, by choice of p t 
Also, by Lemma [51 

Therefore, substituting, 
To 


|| W t - W,|| Mt < 2 \/m log (td/5) < 2 yjm log (T 0 d/5) 
IIA' t 7r(X t )|| M -i < ||A t p t || M -i. 


To 


^ V mT oH T o) 


IU < {2^m,\og(T 0 d/5)) ^ \\X tPt \\ M -, 


t= 1 


t= 1 


< (2 \Jm log (T 0 d/ 6)) \/mT 0 \n(T 0 ) 


<r T ° 

- T 1 


□ 
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